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Abstract 

We present a novel detection method using a deep con¬ 
volutional neural network (CNN), named AttentionNet. We 
cast an object detection problem as an iterative classifica¬ 
tion problem, which is the most suitable form of a CNN 
AttentionNet provides quantized weak directions pointing a 
target object and the ensemble of iterative predictions from 
AttentionNet converges to an accurate object boundary box. 
Since AttentionNet is a unified network for object detection, 
it detects objects without any separated models from the ob¬ 
ject proposal to the post bounding-box regression. We eval¬ 
uate AttentionNet by a human detection task and achieve 
the state-of-the-art performance of 65% (AP) on PASCAL 
VOC 2007/2012 with an 8-layered architecture only. 

1. Introduction 

After the recent advance (TEh of deep convolutional neu¬ 
ral network (CNN) m, CNN based object classification 
methods in computer vision has reached human-level per¬ 
formance on ILSVRC classification mi: top-5 error of 
4.94% mi, 6.67% 1221, 6.8% ED, and 5.1% for human 
mi- These successful methods, however, have limited 
range of application since most of the images are not object- 
centered. Thus, current research focus in visual recognition 
is moving towards richer image understanding such as ob¬ 
ject detection or pixel-level object segmentation. Our focus 
lies in the object detection problem. 

Although many researchers had proposed various CNN- 
based techniques for object detection (241 [H [El [21, it is 
still challenge to estimate object bounding boxes beyond 
the object existence. Even the most successful framework 
so far, Region-CNN (T^ reported top scores (iTl [2T1 |22l 
in ILSVRC’ 14, but it is relatively far from human-level ac¬ 
curacy compared to the classification. One major limita¬ 
tion of this framework is that the proposal quality highly af¬ 
fects the detection performance. Another side of detection 
with CNN, regression models are also applied to detection 

*This work was done when he was in KAIST. He is currently working 
in Adobe Research. 



Figure 1. Real detection examples of our detection framework. 
Starting from an image boundary (dark blue bounding box), our 
detection system iteratively narrows the bounding box down to a 
final human location (red bounding box). 

O [191 [3. but direct mapping from an image to an exact 
bounding box is relatively difficult for a CNN. We thought 
that there must be a room for modification and improvement 
for the use of CNN as a regressor. 

In this paper, we introduce a novel and straightforward 
detection method by integrating object existence estimation 
and bounding box optimization into a single convolutional 
network. Our model is on a similar line of a detection- 
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by-regression approach, but we adopt a successful CNN- 
classification model rather than a CNN-regression model. 
We estimate an exact bounding box by aggregating many 
weak predictions from the classification model, such as an 
ensemble method combines many weak learners to pro¬ 
duce a strong learner. We modify a traditional classification 
model to a suitable form, named AttentionNet, for estimat¬ 
ing an exact bounding box. This network provides quan¬ 
tized directions pointing a target object from the top-left 
(TL) and bottom-right (BR) corner of an image. 

Fig. shows real detection examples of our method. 
Starting from an entire image, the image is recursively 
cropped according to the predicted directions at TL/BR and 
fed to AttentionNet again, until the network converges to 
a bounding box fitting a target object. Each direction is 
weak but the ensemble of directions is sufficient to estimate 
an exact bounding box. The difficulty of estimating an ex¬ 
act bounding box at once with a regression model is solved 
by a classification model. For multiple instances, we also 
place our framework on the sliding window paradigm but 
introduce an efficient method to cope with those. 

Compared with the previous state-of-the-art methods, a 
single AttentionNet does everything but yields state-of-the- 
art detection performance. Our framework does not involve 
any separated models for object proposals nor post bound¬ 
ing box regression. A single AttentionNet 1) detects regions 
where a single instance is included only, 2) provides direc¬ 
tions to an instance at each region, 3) and also correct the 
miss-localizations such as bounding box regression m. 

AttentionNet we demonstrate in this paper is not scalable 
to multiple classes yet. Extension of this method to multiple 
classes with a single AttentionNet is ongoing. We therefore 
demonstrate single-class object detection on public datasets 
to verify the strength of our model. We primarily evaluate 
our detection framework by the human detection task and 
extend to a non-human class. 

Contributions Our contributions are three folds: 

1. We suggest a novel detection method, which estimates 
an exact bounding box by aggregating weak predic¬ 
tions from AttentionNet. 

2. Our method does not include any separated models 
such as the object proposal, object classifiers and post 
bounding box regressor. AttentionNet does all these. 

3. We achieve the state-of-the-art performance on single¬ 
class object detection tasks. 

2. Related Works 

Object detection has been actively studied for the last 
few decades. One of the most popular approaches is part- 
based models due to their strength in handling pose varia¬ 


tions and occlusions. Deformable Part Model (DPM), pro¬ 
posed by Felzenszwalb et al. im , is a flexible model com¬ 
posed of object parts combined with deformation cost to 
manage severe deformations. Poselets, proposed by Bour- 
dev et al. m, is another part-based model demonstrating 
competitive performance. Poselets has numerous part-wise 
HOG detectors covering wide pose variations. Activated 
parts vote the location of objects. 

In recent years, CNN-based approach leads to the suc¬ 
cessful development of object detection with drastic ad¬ 
vances of deep CNN (Tl . Large scale ImageNet IH database 
and the raise of parallel computing contribute to a break¬ 
through in detection ||24l[T3|T2l|T7]|2T]|22l as well as clas¬ 
sification csi. 

The state-of-the-art method in object detection is the 
Region-CNN (R-CNN), which represents a local region by 
CNN activations ifT^ . Specifically, R-CNN framework pro¬ 
ceeds as follows. First it extracts local regions which prob¬ 
ably contain an object by using an object proposal meth¬ 
ods 1^ . The local regions, called object proposals, are 
warped into a fixed size and fed to a pre-trained CNN. Then 
each proposal is represented by mid-level CNN activations 
(e.g. 7th convolutional layer) and evaluated by separated 
classifiers (e.g. SVMs). The object proposals are then 
merged and fed to a bounding box regressor im to cor¬ 
rect miss-localizations. Despite its efficiency and successful 
performance, it has a limitation that proposal quality highly 
affects detection performance. If the proposal model fails 
to propose a suitable candidate, the rest procedures will not 
have the opportunity to detect it. For this reason, EEl 
proposed a new class-agnostic proposal method with a CNN 
regression model to improve the proposal quality while re¬ 
ducing the number of proposals. Also, R-CNN is a cas¬ 
caded model composed of individual components as object 
proposal, feature extraction, classification, and bounding 
box regression, therefore these should be individually en¬ 
gineered for the best performance. 

Apart from R-CNN framework, there is another CNN- 
based approach, which considers object detection as a re¬ 
gression problem. Szegedy et al. 1241 trains a CNN which 
maps an image to a rectangular mask of an object. Ser- 
manet et al. 03 also employ a similar approach but their 
CNN directly estimates bounding box coordinates. These 
methods are free from object proposals, but it is a still de¬ 
batable to leave all to a CNN trained with a mean-square 
cost to produce an exact bounding box. 

Compared to the previous CNN methods, our method 
does not rely on object proposals since we actively explorer 
objects by iterative classifications. Also, the proposed net¬ 
work has a unified classification architecture, which is veri¬ 
fied from many CNN applications and also does not need to 
tune up individual components. 



Figure 2. A pipeline of our detection framework. AttentionNet is composed of two final layers for top-left (TL) and bottom-right (BR) of 
the input image domain. Each of them outputs a direction \ I for TL, ^ \ t for BR) where each corner of the image should go 
to for the next step, or a “stop” sign (•), or “non-human” sign (F). When AttentionNet outputs “non-human” in both layers, the image is 
rejected. The image is cropped according to the weak directions and fed to AttentionNet again, until it meets “stop” in both layers. 


3. Detection with AttentionNet 

We introduce how our detection framework operates un¬ 
der a constrained condition where an image includes a sin¬ 
gle target instance only. Extension of this detection system 
to multiple instances will be described in Sec.|^ 

3.1. Overview 

We summarize the core algorithm of our detection 
framework in Fig. We warp an input image into a fixed 
CNN input size and feed it to AttentionNet. AttentionNet 
outputs two directional predictions corresponding the top- 
left (TL) corner and the bottom-right (BR) corner of the in¬ 
put image. For example, possible predictions for TL are 
the following: “go to right (^)”, “go to right-down (\)”, 
“go to down (I)”, “stop here (•)” and “no instance in this 
image (F)”. Let us assume the prediction of AttentionNet 
indicates |tl and \br as shown in Fig. We then crop 
the input image to the corresponding directional vectors of 
a fixed length /, and feed the cropped image to Attention- 
Net again to get next directions. This procedure is repeated 
until the image meets the following conditions: F in both 
corners, or • in both comers. If AttentionNet returns F at 
both corners, the test ends with a final decision of “no target 
instance in this image”. Also, if AttentionNet returns • at 
both corners, the test ends with the current image as a de¬ 
tection result. This detected image can be back-projected to 
a corresponding bounding box in the original input image 
domain. Given a stopped (detected) bounding box b and 
corresponding output activations yxL, Ybr ^ the detec¬ 
tion score is discriminatively defined as, 

+ 4r> s-t- 

A = A-(2/tl + 2/tl + 2/tl + 2/tl)> (1) 

^BR = Vbr ~ + Vm + Vbr + J/br)- 


This detection framework has several benefits. Most 
portion of object proposal windows generated by tradi¬ 
tional proposal methods truncate target objects. How¬ 
ever, previous methods depends on these object proposals 
ElISlinillllED grade these windows by SVM scores only, 
without an intrinsic model to carefully handle the problem. 
A maximally scored object proposal through SVM does not 
guarantee that it is fitting an entire target object. This is¬ 
sue will be discussed in Sec. |3.4| again. In contrast, starting 
from a boundary out of target object, our method reaches at 
a terminal point with obvious stop signals in both corners 
like the examples in Fig.[^ Compared with previous meth¬ 
ods solving the detection problem by a CNN-based regres¬ 
sion |[T9l[24l, our method adopts a more prosperous classi¬ 
fication model with a soft-max cost rather than a regression 
model with a mean-square cost. Even if the model provides 
weak direction vectors which are quantized and have a fixed 
short length, our prediction for object detection becomes 
stronger as the bounding box is iteratively narrowed down 
to a target object. 

3.2. Network architecture 

The architecture of AttentionNet is illustrated in Fig. 
but we drop pooling and normalization layers from the fig¬ 
ure for simplicity. Layers from the first to the seventh con¬ 
volution (also called seventh fully-connected layer) are the 
same layers in Chatfield et al.'s VGG-M network El. In this 
network, the stride and filter size are smaller at the first layer 
but the stride at the second convolution layer is larger, com¬ 
pared with Krizhevsky et al/s network C6|. This is also 
similar to the CNN suggested by Zeiler and Fergus 1^ . 
We adopt this model due to its superior performance on 
the ILSVRC classification (Top-5 error of 16.1 on a single 
model) without significant increase in training time. Please 
refer to in for more details of this architecture. 


















Figure 3. Real examples of crop-augmentation for training Atten- 
tionNet. The target instance is the right man. Dashed cyan bound¬ 
ing boxes are ground-truths, and the blue bounding boxes are the 
augmented regions. Red arrows/dots denote their ground truths. 


Our detection framework requires two predictions for TL 
and BR, but we prefer a CNN for classification to be trained 
with soft-max loss. While the regression with a mean-squre 
loss forces the model to a difficult bounding box, classifi¬ 
cation of quantized directions with soft-max loss forces the 
model to be maximally activated at a true direction rather 
than exact values of a bounding box coordinates. We there¬ 
fore separate the output layers of TL and BR, because this 
setting enables us to use the soft-max loss with which each 
branch returns each prediction. These final layers are de¬ 
noted by Conv8-TL and Conv8-BR in Fig.[^ Confidences 
of the 5 decisions at TL (BR) are computed with corre¬ 
sponding 5 filters of 1X1 x 4,096 size at Conv8-TL (Conv8- 
BR). These two layers are followed by the final ReLU layer. 

3.3. Training 

To make AttentionNet operate in the scenario we de¬ 
vise, it is quite important to process original training im¬ 
ages to a suitable form. During the test stage, starting from 
an initial test over the entire image boundary to a final de¬ 
cision of “stop” or “no instance”, the number of possible 
decision pairs is 17 (=4x4+1) such as •Itl ^ 

t, for positive regions and {Ftl, Fbr} for neg¬ 

ative regions. We therefore must augment the original train¬ 
ing images into a reformed training set evenly covering 
these 17 cases. Fig. shows real examples how we pro¬ 
cess an original training image to multiple augmented re¬ 
gions. We randomly generate positive regions which satisfy 
the following three rules. 

1. A positive region must include at least 50% of the area 
of a target instance. 


2. A positive region can include multiple instances (as the 
top-left example in Fig.|^, but the target instance must 
occupy the biggest area. Within a cropped region, the 
area of the target instance must be at least 1.5-times 
larger than that of the other instances. 

3. Regions are cropped in varying aspect ratios as well as 
varying scales. 

The second rule is important for complex instance layouts in 
the multiple instance scenario (to be introduced in Sec. |^. 
Without this rule in the scenario, a final bounding box is 
prone to fit multiple instances at once. In order to make 
AttentionNet always narrow the bounding box down to the 
largest instances among multiple instances, we must follow 
the second rule in generating positive regions. The third 
rule is also necessary because aspect ratio and scale change 
during the iterative crops in a test stage. We extract negative 
regions which are not overlapped with bounding boxes of 
a target class, or overlapped with bounding boxes of non¬ 
target classes. The negative regions also have varying aspect 
ratios and scales. 

When we compose a batch to train the CNN, we select 
positive and negative regions in an equal portion. In a batch, 
each of the 16(=4x4) cases for positive regions occupies a 
portion of 1/(2 x 16), and the negative regions occupy the re¬ 
maining portion of 1/2. The loss for training AttentionNet 
is an average of the two soft-max losses computed indepen¬ 
dently in TL and BR. 

3.4. Verification of AttentionNet 

Before we extends AttentionNet to multiple-instances, 
we verify the effectiveness of our top-down approach, 
against to the object proposal 1261 based framework |[T^ . 
As studied by Agrawal et al. ID, Strong mid-level activa¬ 
tions in a CNN come from object “parts” that is distinctive 
to other object classes. Because Region-CNN based detec¬ 
tion relies on each region score coming from the CNN acti¬ 
vations, it is prone to focus on discriminative object “part” 
(e.g. face) rather than “entire object” (e.g. entire body). 

To analyze this issue, we demonstrate a toy experiment 
of human detection. As an object proposal based method, 
we represent object proposals by activations from a fine- 
tuned CNN and score them by a SVM. Then, we choose 
the bounding box which has the maximum SVM score as 
a detection result. We compare this setting with our At¬ 
tentionNet framework in Fig. The Region-CNN, SVM, 
and AttentionNet are trained with the same data including 
ILSVRC’12 classification and PASCAL VOC 2007. The 
test images which contain a single human with a reasonable 
scale are selected from PASCAL VOC 2007 testset. Then, 
we compute the average precision. 

In this experiment, the object proposal based setting 
shows 79.4% while AttentionNet shows 89.5%. The object 
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Only when TL= V and BR= \ , 



Activation maps. Prediction maps. Single instance per image. 


Figure 4. Extracting single-instance regions, where a single instance is included only. Multiple inputs with multiple scales/aspects are fed to 
AttentionNet, and prediction maps are produced. Only image regions satisfying {\tl, \br} are regarded as the single-instance regions. 
These regions are fed to AttentionNet again for final detection. Note, the CNN here and that in Fig.j^are the same one, not separated. 



Figure 5. Real detection examples of the object proposal based 
method (left) and AttentionNet (right). In the left column, a red 
bounding box is the top-1 detected region among top-10 object 
proposals (cyan) with the maximum SVM score. 


proposal based method shows much lower detection perfor¬ 
mance because of the relatively weak correlation between 
strong activation and “entire” human body. In contrast, 
AttentionNet reaches a terminal point well starting from a 
boundary out of a target object. As shown in Fig. the 
maximally scored object proposal is prone to focus on dis¬ 
criminative faces rather than “entire” human body. 

4. Extension to Multiple Instances 

AttentionNet is trained to detect a single instance in an 
image and operates in this way. In this section, we intro¬ 
duce an efficient method to extend our detection framework 
to a practical situation where an image involves multiple 


instances. Our straightforward solution is to propose candi¬ 
date regions where only a single instance is included. We 
reuse AttentionNet for the single instance region proposal, 
therefore no separated model is necessary. We then detect 
each instance from each region proposal, and merge the re¬ 
sults into a reduced number of bounding boxes followed 
by a final refinement procedure where AttentionNet is also 
reused again. 

4.1. Efficient single instance region proposal 

Let us assume we have an arbitrary region in an image. 
If we feed this region to AttentionNet, the 17 decision com¬ 
binations in TL/BR are possible. Among them, only the 
output of {\tl, \br} guarantees that the region includes 
entire body of a target instance with proper margins. In the 
other decision combinations, the region is probably truncat¬ 
ing a target instance or does not contain an instance. This is 
the logic to make sure whether an entire single instance is 
included or not. 

To boost the recall of the single instance region pro¬ 
posals, we place our framework in the sliding window 
paradigm, but we do not naively crop and feed each of 
sliding windows to AttentionNet. Following a technique 
successfully used in (25] 123, we also utilize the property 
that a CNN does not require fixed size input only because 
a fully connected layer is a convolution layer composed of 
filters of 1X1 size. For example, if an image 2-times larger 
(321x321) than a regular CNN input (227x227) is fed to 
AttentionNet, a spatial activation map of 3 x 3 x 5 size is pro¬ 
duced from an output layer as shown in Fig. Here, the 9 
activation vectors (G 5R^) correspond to 9 local patches of 
227x227 size with a stride of 32. In this way, we can speed 
up the sliding window method with AttentionNet. 




































































































(a) Initial detections. 


(b) Initial merge. 


(c) Re-initialize. (x2.5) 


(d) Re-detections. 


(e) Final merge. 


Figure 6. Real examples of our detection procedure, including initial results (a~b) and refinement (c~e). Initially detected candidates 
come from Fig. [^followed by Fig.j^are merged by an intersection over union (loU) of 0.8. We extend each merged box to 2.5-times larger 
size, and feed them to Fig.|^again. Finally we merge the second results by an loU of 0.5. 


Since instances in images are possibly diverse in aspect 
ratio and scale, multiple scales/aspects are also required for 
the sliding windows. We therefore feed multi-scale/aspect 
images to AttentionNet, and obtain predictions of thousands 
of sliding windows, as depicted in Fig.|^ Given the predic¬ 
tion maps, only corresponding regions of the prediction of 
{\tl, \br} are cropped and fed to AttentionNet again for 
the final detection. The number of scales/aspects can be de¬ 
terministically set according to the size/aspect statistics of 
ground-truth bounding boxes in a training set. 

4.2. Initial detection and final refinement 

Each single instance region proposal produced in 
Sec. |4.1| is fed to AttentionNet iteratively until it meets 
{•tl, •br} or {Ftl, Fbr}. The first image in Fig. shows 
a real example of the initial detections. These bounding 
boxes are merged to a decreased number by single-linkage 
clustering: a group of bounding boxes satisfying a mini¬ 
mum intersection over union (loU) of are averaged into 
one with their scores of Eq. Q- 

To refine the result, (TTl [121 employ a bounding box 
regression, which finally re-localizes the bounding boxes. 
This is a linear regression model which maps a given fea¬ 
ture of a bound box to a new detection window. 

In our case, we can employ AttentionNet again as the 
role for bounding box regression for further refinement. We 
re-scale each bounding box in Fig. [^(b) to a new initial 
window in Fig.|^(c) by a re-scaling factor of /3. These re¬ 
initialized windows are fed to AttentionNet again and re¬ 
sult in new bounding boxes as shown Fig. [^(d). This re¬ 
detection procedure gives us one more chance to reject false 
positives as well as fine localization. These bounding boxes 
are finally merged to final results by an loU of o^i. 

5. Evaluation 

We perform the object detection task on public datasets 
to verify the strength of AttentionNet. We primarily apply 


our detection framework to a human detection problem and 
extend it to a non-human class. 

Among a wide range of object classes, it is beyond ques¬ 
tion that the class “human” has taken the center stage in 
object detection for decades because of its wide and attrac¬ 
tive applications. Nonetheless, human detection on uncon¬ 
trolled natural images is still challenging due to severe de¬ 
formations caused by pose variation, occlusion, and over¬ 
lapped layout. For rigorous verification in such challenging 
settings, we choose human as our primary target class. 

Datasets We select PASCAL VOC 2007 and 2012 HOl 
where images are completely uncontrolled because they are 
composed of arbitrary web images from Flickr. Humans in 
these sets are severely occluded, truncated and overlapped 
with diverse pose variations as well as scales. Following 
the standard setting in previous human detection over these 
sets, we use all the “trainval” images for training, and re¬ 
port an average precision (AP). For PASCAL VOC 2007, 
we rigorously use the evaluation function provided by the 
development toolkit. For PASCAL VOC 2012, we submit 
our result to the evaluation server and obtain the AP value. 

Training Our framework requires only one training stage 
for AttentionNet. The selected datasets provide relatively 
small number of training images, while AttentionNet is 
composed of many weights to be optimized. Thus, we pre¬ 
train the model with ILSVRC 2012 classification dataset 
ifT^ and transfer the model to our target datasets. Follow¬ 
ing ID , the learning rate of the pre-trained layers (e.g. 0.001 
for convl to conv7) is 10-times smaller than that of the re¬ 
initialized layer (e.g. 0.01 for conv8-TL and conv8-BR). 

Parameters in test stage Our method requires several pa¬ 
rameters to be determined before the test stage. We tuned all 
these parameters over the validation set of PASCAL VOC 
2007. After we determine the proper parameter values, we 


















Method 

Extra data 

VOC’07 

VOC’12 

AttentionNet 

ImNet 

61.7 

62.8 

AttentionNet + Refine 

ImNet 

65.0 

65.6 

AttentionNet + R-CNN 

ImNet 

66.4 

69.0 

AttentionNet + Refine + R-CNN 

ImNet 

69.8 

72.0 

Person R-CNN + BBReg 

ImNet 

59.7 

N/A 

Person R-CNN + BBReg x 2 

ImNet 

59.8 

N/A 

Person R-CNN + BBReg x 3 

ImNet 

59.7 

N/A 

Felzenszwalb et a/.’ 10 Hill 

None. 

41.9 

N/A 

Bourdev et 

H3D 

46.9 

N/A 

Szegedy et a/.’ 13 l24l 

VOC’12 

26.2 

N/A 

Erhan et al3 

None. 

37.5 

N/A 

Gkioxari et a/.’ 14 IT^ 

VOC’12 

45.6 

N/A 

Bourdev et 

ImNet + H3D 

59.3 

58.7 

He etaVU^ 

ImNet 

57.6 

N/A 

Girshick et a/.’ 14 IT^ 

ImNet 

58.7 

57.8 

Girshick et a/.’ 14 lIT^ 

ImNet 

64.2* 

N/A 

Shen and Xue ’ 14 ll^ 

ImNet 

59.1 

60.2 


*Very deep model of 16 convolution layers HJ] is used. 


Table 1. Human detection performance on PASCAL VOC 
2007/2012. ImNet denotes ILSVRC 2012 classification set. 
AP(%) is reported. 


apply the same parameters to the whole experiments in this 
paper regardless of datasets. We use the parameters as fol¬ 
lows. We set the length I of each direction vector to 30 
pixels, and we limit the maximum number of iterative feed¬ 
forwards to 50 to prevent the possibility of divergence. To 
increase the chance for detecting large instances (e.g. an 
image fully filled with a face), we first re-size the average 
image to (2h x 2u;)-size and then place an input image of 
(h X w)-sizo at the center of the magnified average image 
before feed-forwarding. We use 7 scales with a scale step 
of 2, and 3 aspect ratios ofj;^ | 1.0,1.5,2.0}, according to 
the statistics of ground-truth bounding boxes in the training 
set. We set the merging parameters of o^o and ai to 0.8 and 
0.5, respectively. We set the re-scaling factor (3 in Fig.[^ 
(c) to 2.5. When we do not perform the refinement step of 
Fig.|6[(c e), we set the initial merging parameter o^o to 0.6. 

We first evaluate the performance of human detection on 
PASCAL VOC 2007/2012. We compare our method against 
the recent state-of-the-art methods, and Table [T] shows the 
result. Without the refinement step of Fig. [^(c^e), our 
method achieves 61.7% and 62.8% for each dataset. When 
we equip the refinement step, we achieve a new state-of- 
the-art score of 65.0% and 65.6% with an additional im¬ 
provement of +3.3% and +2.8%. For the refinement step, 
we simply re-use AttentionNet for re-localization, therefore 
we do not need an extra model. 

Similar to ours, Poselets-based methods Eiiniia are 
also limited to a single-class (e.g. human) object detec¬ 
tion. Our method outperforms these methods with a large 
margin. Though we does not include an intrinsic human 



-R-CNN (58.7%) 

- AttentionNet (61.7%) 

- AttentionNet+Refine (65.0%) 


(a) Person class 
Figure 7. Precision-recall 



-R-CNN (36.8%) 

- AttentionNet (39.2%) 

- AttentionNet+Refine (41.7%) 


(b) Bottle class 
on PASCAL VOC 2007. 


model to handle diverse poses and severe occlusions, our 
framework successfully converged to the window fitting the 
human from a window coarsely covering an entire human 
body. Our top-down approach is robust to the diverse hu¬ 
man poses (Fig.[^ because our model is trained to operate 
in that way with carefully augmented training samples. 

In the detection-by-regression manner, most similar 
works to ours are Emu. CNN-regression models com¬ 
bined with a mean square error are trained to produce a 
target object mask 1^ . or bounding box coordinates (91 
for the purpose of class-agnostic object proposals. Our At¬ 
tentionNet clearly outperforms these methods, and it veri¬ 
fies the strength of the ensemble of weak directions from a 
promising CNN-classification model. 

R-CNN dll has been the best performing detection 
framework which is used for 031201 showing state-of-the- 
art performances, but our method also beats these methods 
which use the 8-layered architecture. Our result is even 
slightly better (+0.8%) than that of R-CNN ifT^ equipped 
with a 16-layered very deep network (H, which yields the 
top-5 error of 7.0% in ILSVRC’ 12 classification. However, 
one obvious strength of R-CNN is class-scalability while 
yielding high performance. Extension to multiple classes 
with a single AttentionNet is our primary direction. 

To make the comparison completely fair, we trained R- 
CNN on the setting of “person”-versus-all by using the R- 
CNN cod^ provided by the authors. It is noted by “Person 
R-CNN + BBReg”. Our method still shows better results, 
demonstrating the significant margin of +5.3% on PASCAL 
VOC 2007. We also tried to iterate the bounding-box re¬ 
gression in R-CNN. It is noted by “BBReg xA^”. The im¬ 
provement is negligible: +0.1% and +0.0% for second and 
third iterations. These imply the benefit of our stacked clas¬ 
sification strategy against to the stacked regression. 

Fig.[7}(a) shows the precision-recall curves of human de- 

^ https://github.com/rbgirshick/rcnn 



















Method 

Extra data 

VOC’07 

VOC’12 

AttentionNet 

ImNet 

39.2 

41.2 

AttentionNet + Refine 

ImNet 

41.7 

42.5 

AttentionNet + R-CNN 

ImNet 

42.9 

43.2 

AttentionNet + Refine + R-CNN 

ImNet 

45.5 

45.0 

Bottle R-CNN + BBReg 

ImNet 

32.4 

N/A 

Bottle R-CNN + BBReg x 2 

ImNet 

32.4 

N/A 

Bottle R-CNN + BBReg x 3 

ImNet 

32.5 

N/A 

He et a/.’14fT4l 

ImNet 

40.5 

N/A 

Girshick et al.' 14 ifl^ 

ImNet 

36.8 

32.6 

Girshick et aC 14 ifT^ 

ImNet 

44 . 6 * 

N/A 

Shen and Xue ’ 14 

ImNet 

36.3 

33.1 


*Very deep model of 16 convolution layers HJi is used. 


Table 2. Bottle detection performance on PASCAL VOC 
2007/2012. AP(%) is reported. 

tection on PASCAL VOC 2007. We plot the curve of R- 
CNN by using the source code from the authors. The curve 
clearly shows both of strength and weakness of Attention- 
Net. Compared with R-CNN, our method yields much high 
precision, but the tail is truncated early. It implies lower 
recall than R-CNN. This tendency comes from our hard de¬ 
cision strategy: AttentionNet 1) accepts a region as an ini¬ 
tial candidate only if the both corners satisfy {\tl, \br} 
at the same time, and 2) takes a final positive decision only 
if the both corners indicate “stop”. This hard criteria gives 
us significantly reduced number of bounding boxes with a 
quite strong confidence, but results in weak recall. Our AP 
of 65.0% is a result only with 4,863 boxes, while R-CNN 
yields 58.7% with 53,624 boxes. 

We can derive a positive meaning from this observation: 
detection results from the two different approaches are com¬ 
plementary. We therefore combine the two results as in the 
following. Because our bounding boxes are more confident, 
we rejects R-CNN bounding boxes which are overlapped to 
ours more than an loU of 0.7, and add a bias to our bounding 
box scores to be followed by the remaining R-CNN bound¬ 
ing boxes. As reported in Table we achieve the signifi¬ 
cantly boosted performance of 69.8% from the combination 
of AttentionNet and R-CNN. 

One AttentionNet is currently handle only one object 
class yet, however its application does not limited to specific 
classes since it does not include any crafted class-specific 
model. We therefore demonstrate performance on another 
object class. Among the 20 classes in PASCAL VOC series, 
“bottle” is one of the most challenging object class. Bottle 
is the smallest class, occupying only 8k pixels in average, 
while the other classes occupy 38k pixels in average. Its 
appearance is also indistinct due to its transparent material. 
Thus, we select “bottle” and verify our method for this chal¬ 
lenging case. We use the exactly same parameters used in 
human detection without any tuning. 


Table |2] shows the result of bottle detection on PASCAL 
VOC 2007/2012. We observe a similar tendency to human 
detection in overall. Except R-CNN ca equipped with 
very large CNN 1^ . our method yields the best scores in 
both datasets. Specifically, AttentionNet with refinement 
beats the previous methods by the gaps more than +9.4% in 
VOC 2012. Our method combined with the complementary 
R-CNN still gives us performance improvements. 

Fig.[7]-(b) shows the precision-recall curves of bottle de¬ 
tection on PASCAL VOC 2007. Although R-CNN shows 
slightly better precision in low recall, their precision steeply 
decreases while our precision decreases with a less slope. 
Our curve is truncated early with a lower recall than R- 
CNN. Further positive mining and bootstrapping could be 
possible solutions to boost recall. 

6. Conclusions 

In this paper, we have proposed a novel method for ob¬ 
ject detection. We adopted a well-studied classification 
technique into object detection and presented AttentionNet 
to predict weak directions towards a target object. Since we 
actively explorer an exact bounding box of a target object 
in a top-down approach, we does not suffer from the qual¬ 
ity of initial object proposals and provide accurate object 
localization. 

Through this work, we have two important observations. 
Firstly, we achieve the new state-of-the-art performance by 
changing the way of object detection. Second, our top-down 
approach is complementary to the previous state-of-the-art 
method using a bottom-up approach, therefore combining 
two approaches boosts the performance of object detection. 

Limitations and future work We have two limitations 
in this study. Our AttentionNet is not scalable to multiple 
classes. However, it is a fact that AttentionNet has a po¬ 
tential for extension to generic object classes, because this 
model does not include any crafted class-specific models. 
The other thing is low recall. It is caused by our hard de¬ 
cision strategy as we stated. We believe there is a room for 
boosting recall. For example, we can loosen the hard de¬ 
cision criteria by employing a thresholding strategy to the 
direction scores from AttentionNet. Positive mining and 
bootstrapping can also be a promising candidate solutions. 
We also leave these issues as a future work. 
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